A Look Into International Goalscoring

26 January, 2024

Categories: football, data analysis, pandas

Competitive play is the meat and potatoes of international football. Who are the best players to score on a consistent basis when it really counts? "Back in my day, goals were few and far between!" Join me in this brief analysis where we'll be taking information on 43189 goals over 100 years of football to piece together the cold, hard facts interpretations!

The dataset in question describes each goal with 8 variables:

date        home_team	away_team   team        scorer              minute	own_goal    penalty
1916-07-02  Chile       Uruguay     Uruguay     José Piendibene     44.0	False       False

Who are the top scorers?

The code below aggregates (.agg()) the data by the 'scorer' and counts how many goals each has scored:

record_goalscorers = goals.groupby(
        by=['team','scorer']
    )[['scorer']].agg(
        {'scorer':'count'}
    ).rename(
        columns={'scorer': 'goals_scored'}
    ).reset_index().sort_values(by=['goals_scored'],ascending=False)

It's interesting to see that Lionel Messi is \(5^{\text{th}}\) in this list. His goals in friendly matches push him up to \(3^{\text{rd}}\) place (with \(106\) goals) in the overall international tally as can be found here (accurate as of the date of this blogpost).

Similarly, Ali Daei from Iran shoots all the way up from \(7^{\text{th}}\) with \(49\) goals to a whopping \(108\) goals at \(2^{\text{nd}}\) place — friendly matches really skew the perception!

Is there any general trend in the number of goals scored in competitive matches over time?

As usual, a picture plot is worth a thousand words:

Overall, there's been a general upturn in the number of goals. The upturn in goals is likely positively correlated with the expansion of international competitions, continental championships and qualifiers. There also seems to be some level of periodicity that could be explained by there being more games in the qualifying period roughly a year before each major competition.

Instead of a scatter plot, a bar-chart colour-coded by competitions may help to surface a pattern. To do so, I need to create another column that will serve as a reference point to colour-code the bars:

def which_tournament(year):
    if year == 2021:
        return 'Euro'
    if (2020 > year >= 1960) and ((year - 1960)%4 == 0):
        return 'Euro'
    elif (year >= 1930) and ((year - 1930)%4 == 0):
        return 'World Cup'
    return 'No Tournament'

goals_by_year['tournament'] = goals_by_year['year'].apply(which_tournament)
To go even more granular, a similar process can highlight every year that comes before, for example, a World Cup competition:

This seems like something that could be investigated more rigorously with the theory of time series.


Maybe goals aren't as hard to come by these days as previously thought!